April 8, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)
Success:
Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!
Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!
Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?
All methods we’ll talk about can be sampled!
What makes a model generative?
We should be able to provide an answer to the question:
\[ P(\mathbf x | \boldsymbol \Theta) = ? \]
for any viable \(\mathbf x\).
A generative model is one where we can answer question about the structure of \(\mathbf x\)
Frequently under assumptions, but still can answer it!
PCA:
\[ \mathbf z = \mathbf W^T \mathbf x \]
\[ \mathbf x = \mathbf W \mathbf z \]
where \(\mathbf W\) is a \(P \times K\) weight matrix with \(K << P\).
Optimal solution under squared reconstruction error is \(\mathbf W = \mathbf Q_K\) where \(\mathbf Q_K\) is the first \(K\) eigenvectors of the covariance matrix
\[ \mathbf X^T \mathbf X = \mathbf Q \mathbf D \mathbf Q^{-1} \]
with \(\mathbf D\) being a diagonal matrix with the eigenvalues sorted from smallest to largest.
Generative goal:
\[ P(\mathbf x | \mathbf W) = \int P(\mathbf x | \mathbf W , \mathbf z) P(\mathbf z) d \mathbf z \]
For PCA, all we know is that \(\mathbf x = \mathbf W \mathbf z\) and that \(\mathbf z = \mathbf W^T \mathbf x\)
DOES NOT COMPUTE!!!
This is referred to as a deterministic bottleneck autoencoder
Learn a set of encoder and decoder functions that map \(\mathbf X\) to itself!
Restriction: each input instance passes through a low-dimensional bottleneck ( \(K << P\) )
Can’t just learn itself, so needs to set up \(\mathbf Z\) to represent as much of the variation in \(\mathbf X\) as possible
Note that PCA is a special case!
No good reason to do this when we have autodiff that can solve for arbitrarily complex models
Generative goal:
\[ P(\mathbf x | \Theta) = \int P(\mathbf x | f(\mathbf z), \Theta) P(\mathbf z) d \mathbf z \]
For a deterministic autoencoder, all we learn is \(f(\mathbf z)\) through our neural network backbone!
DOES NOT COMPUTE!!!
Solution: Assumptions. For centered \(\mathbf x\):
\[ P(\mathbf x | \mathbf z) = \mathcal N_P(\mathbf x | f_\mu(\mathbf z), f_\sigma(\mathbf z)) \]
Each \(\mathbf x\) is a random draw from a multivariate normal distribution with moments that are a function of the latent variable
Similar to PCA, just saying that we have some uncertainty in the mapping of \(\mathbf z \rightarrow \mathbf x\)
\[ P(\mathbf z) = \mathcal N_K(\mathbf z | 0 , \mathcal I_K) \]
Prior to seeing any data, we believe that each \(\mathbf z\) is a random draw from a standard multivariate normal
Inconsequential choice since we’re going to learn \(\mathbf z\) anyways
\(\mathbf z\) is latent, so we make the structure!
Made a little simpler/general:
\[ P(\mathbf x | \mathbf z) = \mathcal P(\mathbf x | f(\mathbf z, \boldsymbol \Theta)) \]
\(f(\mathbf z)\) is some mapping of the latent variable to the original feature space
\(\boldsymbol \Theta\) is a set of parameters that dictate the mapping
Our goal: Find values for \(\boldsymbol \Theta\) that maximize the likelihood with which we would observe our input features given the parameters.
\[ \hat{\boldsymbol \Theta} = \underset{\Theta}{\text{argmax }} \prod \limits_{i = 1}^N P(\mathbf x_i | z_i, \Theta) \]
But \(\mathbf x\) depends on the values of the latent variables, \(\mathbf z\), so:
\[ \hat{\Theta} = \underset{\Theta}{\text{argmax }} \prod \limits_{i = 1}^N \int P (\mathbf x_i | \mathbf W \mathbf z , \boldsymbol \Psi) P(\mathbf z_i) d\mathbf z \]
Previously, we’ve seen that MLE problems can be made simpler by optimizing the log-likelihood
\[ \hat{\boldsymbol \Theta} = \underset{\Theta}{\text{argmax }} \sum \limits_{i = 1}^N \log \int P(\mathbf x_i | \mathbf z, \Theta) P(\mathbf z_i) d\mathbf z \]
Unfortunately, we can’t push the log inside the integral!
Finding the derivative of a log-integral is not particularly easy…
Maximizing this expression is tricky!
What’s holding us back here is that we need to learn \(\mathbf z\) and integrate over it
It would be way easier if we knew \(\mathbf z\) beforehand
But, \(\mathbf z\) is latent, so we learn it as a function of the data!
The Gaussian factor model has many different solution methods:
Eigendecomposition (it’s equivalent to PCA with a light twist)
MLE via an expectation-maximization routine
Bayesian MAP estimation using Gibbs sampling
We’re going to show a different method that will be extendable…
Goal:
Find \(\boldsymbol \Theta\) such that we maximize the likelihood that we see our data:
\[ \hat{\boldsymbol \Theta} = \underset{\boldsymbol \Theta}{\text{argmax }} \sum \log P(\mathbf x_i | \boldsymbol \Theta) \]
Let’s start with Bayes rule:
\[ P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta) = \frac{P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)P(\mathbf z_i)}{P(\mathbf x_i | \boldsymbol \Theta)} \]
and rearrange to get the quantity that we want:
\[ P(\mathbf x_i | \boldsymbol \Theta) = \frac{P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)P(\mathbf z_i)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)} = \frac{P(\mathbf x_i , \mathbf z_i | \Theta)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)} \]
Things we know/define:
\[ P(\mathbf x_i | \mathbf z_i, \boldsymbol \Theta) \text{(Our Likelihood)} \text{ ; } P(\mathbf z_i) \text{(The prior)} \]
Things we don’t know:
\[ P(\mathbf x_i | \boldsymbol \Theta)\text{ (The marginal likelihood)} \text{ ; } P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)\text{(The posterior over the latent Z)} \]
Another way to view these terms:
\[ P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta) \]
is a decoder - translate a latent vector of length \(K\) to the input space
\[ P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta) \]
is an encoder - translate an input to the latent space
The encoder and decoder are probabilistic!
Input goes in and maps to a distribution in the latent space
Latent value is a distribution and maps out to another distribution in the input space
Can’t do anything with this until we figure out how to find \(\boldsymbol \Theta\) that maximizes the marginal likelihood
Right now, our sticking point is the encoder:
\[ P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta) \]
We don’t know what this is!
We only have a prior on \(\mathbf z\)
Bayesian approaches to simplifying the posterior:
Analytical solution: Works in the base case, but not extendable to the autoencoder case
MAP approximation: Again, works in the base case, but MVNs are too simple for complicated structures
Third method: find an approximate posterior that is as close as possible to the true posterior
Solution: Come up with an approximate distribution that can closely learn the form of \(P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)\)
\[ Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi) \approx P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta) \]
\(Q()\) should be a distribution that is easy to work with like the multivariate normal distribution.
We’ll see why this works in a second
Way more flexible than you might think
Multiply/divide by \(Q()\) (because we can):
\[ P(\mathbf x_i | \boldsymbol \Theta) = \frac{P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)P(\mathbf z_i)Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)} \]
Take the log of both sides (and rearrange in a special way):
\[ \log P(\mathbf x_i | \boldsymbol \Theta) = \log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta) - \log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i)} + \log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)} \]
Finally, note that:
\[ E_Q[\log P(\mathbf x_i | \boldsymbol \Theta)] = \log P(\mathbf x_i | \boldsymbol \Theta) \]
since the marginal likelihood doesn’t depend on \(Q\)
Applying this expectation across our quantity:
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - E_Q\left[\log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i)}\right] + E_Q \left[ \log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)}\right] \]
Goal - maximize this quantity:
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - E_Q\left[\log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i)}\right] + E_Q \left[ \log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)}\right] \]
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] \]
is a measure of the expected reconstruction error w.r.t. our approximation
Assuming \(P()\) is a normal distribution like in factor analysis:
\[ \propto \exp\left[-\frac{1}{2} (\mathbf x_i - f(\mathbf z_i; \Theta))^T \boldsymbol \Psi^{-1} (\mathbf x_i - f(\mathbf z_i; \Theta)) \right] \]
we’re getting a squared difference between the input and the reconstructed input
The maximum of the quantity is achieved when the reconstruction error is lowest!
The second quantity is a special one called the KL Divergence
\[ D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) = \int Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi) \log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i)} d\mathbf z_i \]
The KL divergence is a measure of “distance” between two distributions
Always greater than or equal to zero (see Gibbs’ inequality)
Only zero when \(Q = P\) for all values of \(\mathbf z_i\)
In general, the KL divergence is really hard to compute
In the special case where \(Q\) and \(P\) are \(K\) dimensional multivariate normal distributions, there is a closed form expression for the KL divergence:
\[ Q \sim \mathcal N(\boldsymbol \mu_0 , \boldsymbol \Sigma_0) \text{ ; } P \sim \mathcal N(\boldsymbol \mu_1 , \boldsymbol \Sigma_1) \]
\[ D_{KL}(Q || P) = \]
\[ \frac{1}{2} \left( \text{tr}\left(\boldsymbol \Sigma^{-1}_1 \boldsymbol \Sigma_0 \right) - K + (\boldsymbol \mu_1 - \boldsymbol \mu_0)^T \boldsymbol \Sigma^{-1}_1 (\boldsymbol \mu_1 - \boldsymbol \mu_0) +\\ \log \left(\frac{\text{det} \boldsymbol \Sigma_1}{\text{det} \boldsymbol \Sigma_1}\right) \right) \]
Let’s look at a figure in the notebook.
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) + D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i | \mathbf x_i)) \]
Our goal is to maximize this quantity
Maximize the first term
Since KL is always positive, minimize the second term
Minimize the difference between the proposal and the prior!
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) + D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i | \mathbf x_i)) \]
The second KL divergence is the distance between the proposal and the unknown conditional
We don’t know what \(P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)\) is…
If our proposal is the same as the conditional, then the KL is zero
These two terms control the difference between the prior and the posterior
Unfortunately, we don’t know this term!
The evidence lower bound (ELBo):
\[ \log P(\mathbf x_i | \boldsymbol \Theta) \ge E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]
Optimize this quantity which is hopefully close to the true value!
In practice, if the resulting posterior is approximately normal, then it’s pretty good!
Bayes CLT states that as \(N \to \infty\), posteriors converge to multivariate normals
This generic distributional optimization strategy is called variational inference
In a second, we’re going to see an alteration of this method called amortized variational inference
Don’t learn different distributions - just learn a mapping of the input to the moments of the variational distributions!
Way faster.
For the generic latent variable model:
\[ P(\mathbf z) \sim \mathcal N_K(\mathbf 0 , \mathcal I_K) \]
\[ P(\mathbf x | \mathbf z) \sim \mathcal N_P(f(\mathbf z), \boldsymbol \Sigma_x) \]
\[ Q(\mathbf z | \mathbf x) \sim \mathcal N_K((g(\mathbf x), \boldsymbol \Sigma_z) \]
Find values for parameters that minimize the negative variational lower bound:
\[ -E_Q[\log P(\mathbf x | \mathbf z )] + D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]
Factor analysis is a special case of this model where we actually know \(P(\mathbf z_i | \mathbf x_i)\):
\[ P(\mathbf z) \sim \mathcal N_K(\mathbf 0 , \mathcal I_K) \]
\[ P(\mathbf x | \mathbf z) \sim \mathcal N_P(W \mathbf z_i, \boldsymbol \Sigma_x) \]
\[ Q(\mathbf z | \mathbf x) \sim \mathcal N_K(\mathbf W^T (\mathbf W \mathbf W^T + \boldsymbol \Psi)^{-1}\mathbf x, \mathcal I_K - \mathbf W^T(\mathbf W \mathbf W^T + \boldsymbol \Psi)^{-1} \mathbf W) \]
Note that this construction is pretty arbitrary about the mapping between \(\mathbf x\) and the mean and covariance for \(Q()\) and \(\mathbf z\) to \(P()\)
Like deterministic autoencoders, we can replace the linear mapping with an arbitrary function learned via a deep model
The in-between parts can be FCNN and/or CNN backbones!
No different than deterministic autoencoders
Called amortized variational inference since we’re not learning posteriors, per se
Just learning functions
Main difference:
The generic routine:
Specify a decoder likelihood w.r.t. the input - \(P(\mathbf x | f(\mathbf z))\)
Specify a prior on the latent variables in \(K\) dimensions - \(P(z)\)
Specify an approximate posterior over the latents - \(Q(\mathbf z | g(\mathbf x))\)
Learn \(g()\) and \(f()\) that maximize the ELBo:
\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]
Why is this helpful?
Let’s draw a picture on the board.
Main point:
We’re creating a mapping from a point input to a distribution in the latent space
Mapping the latent distribution to another distribution over reconstruction candidates
Fills in the gaps in a way that deterministic autoencoders do not!
Some practical considerations:
You’ll often just skip the second distributional draw and allow all uncertainty to be propagated upwards from the latent distribution. Doesn’t make too much of a difference in the final model.
We’ll almost always want to restrict the covariance matrix for the mapping of \(\mathbf x\) to \(\mathbf z\) to be diagonal. This makes the latent space orthogonal and allows easy separation of sources of variation
You’ll almost always want to make \(Q(\mathbf z | \mathbf x)\) and \(P(\mathbf x | \mathbf z)\) multivariate normal distributions. Similarly, you’ll want to make your prior over the latent space multivariate normal with mean 0 and identity covariance. This makes the computations of KL divergence needed tractable.
With a standard normal prior in \(K\) dimensions and a diagonal normal proposal in \(K\) dimensions, there is a simple form for the KL divergence
\[ D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) = \frac{1}{2} \sum \limits_{k = 1}^K \left[\sigma^2_k + \mu^2_k - 1 - \log(\sigma^2_k)\right] \]
Let’s go through an example VAE in PyTorch.
VAEs produce way more coherent generated images than other latent variable methods!
Next time, we’ll briefly touch on two things with VAEs
Editing images
\(\beta\)-VAEs
Then, we’ll start our discussion of normalizing flow models